Week 3: Causes, Confounds, and Colliders

Good and bad controls

finding backdoor paths

  1. Start at treatment (X)

  2. Look for any arrows coming INTO X

  3. Follow all possible paths to outcome (Y)

  4. A valid adjustment set blocks all backdoor paths

  5. But be careful not to control for colliders!

exercise

Go to dagitty.net and create this DAG:

exercise

  1. List all paths between Exercise and Health
  2. Identify which paths are backdoor paths
  3. Find all valid adjustment sets if we want to estimate the effect of Exercise on Health
  4. BONUS: What happens if we control for Motivation? Why?

exercise: simulate simple confounding

Copy the code below.

  1. Run the base simulation and observe results

  2. Modify the simulation parameters:

  • Change the strength of the confounding (modify the 0.5, 0.8, and 0.6 coefficients)
  • Change the sample size (N)
  • Add a true causal effect (modify Y calculation to include X)
  1. Answer these questions:
  • What happens to the bias in the naive estimate as you increase the strength of confounding?
  • How does sample size affect the precision of your estimates?
  • When does controlling for Z fail to recover the true causal effect?
Code
set.seed(9)
#number of sims
N = 1000
# Generate data
U <- rnorm(N)  # Unobserved confounder
X <- rnorm(N, mean = 0.5 * U)  # Treatment affected by U
Y <- rnorm(N, mean = 0.8 * U)  # Outcome affected by U
Z <- rnorm(N, mean = 0.6 * U)  # Observed variable that captures U

d <- data.frame(X, Y, Z)

# Fit models
flist1 <- alist(
  Y ~ dnorm(mu, sigma),
  mu <- a + bX*X,
  a ~ dnorm(0, .5),
  bX ~ dnorm(0, .25),
  sigma ~ dexp(1)
)

m1 <- brm(
  data=d,
  family=gaussian,
  Y ~ 1 + X,
  prior = c( prior(normal(0, .50), class=Intercept),
             prior(normal(0, .25), class=b),
             prior(exponential(1), class=sigma)),
  iter=2000, warmup=1000, seed=3, chains=1
)

posterior_summary(m1)


m2 <- brm(
  data=d,
  family=gaussian,
  Y ~ 1 + X + Z,
  prior = c( prior(normal(0, .50), class=Intercept),
             prior(normal(0, .25), class=b),
             prior(exponential(1), class=sigma)),
  iter=2000, warmup=1000, seed=3, chains=1
)

posterior_summary(m2)

post.1 <- as_draws_df(m1)
post.2 <- as_draws_df(m2)

results_df = data.frame(naive    = post.1$b_X,
                        adjusted = post.2$b_X)
results_df %>% 
  pivot_longer(everything()) %>% 
  ggplot(aes(x = value, fill = name)) +
  geom_density(alpha = .5) +
  geom_vline(aes(xintercept = 0), linetype = "dashed")

bad controls

“Bad controls” can create bias in three main ways:

  • Collider bias (as we saw in the previous exercise)
  • Precision parasites (reduce precision without addressing confounding)
  • Bias amplification (making existing bias worse)

Warning signs of bad controls:

  • Post-treatment variables
  • Variables affected by both treatment and outcome
  • Variables that don’t address actual confounding paths

exercise

Use this code to simulate new variables:

n = 100
# Z affects X but is not a confounder
Z <- rnorm(n)
X <- rnorm(n, mean = Z)
Y <- rnorm(n, mean = X)  # True effect of X on Y is 1

Using different sample sizes (n = 50, 100, 1000), test two models exploring the relationship between X (exposure) and Y (outcome). For each sample size, compare:

  • Standard errors without controlling for Z
  • Standard errors when controlling for Z

How does sample size affect the impact of the precision parasite (Z)? Under what conditions is the precision loss most severe?

exercise

Use this code to simulate new variables:

n = 100
conf_strength = 1
# U is unmeasured confounder
U <- rnorm(n)
Z <- rnorm(n)
X <- rnorm(n, mean = Z + conf_strength * U)
Y <- rnorm(n, mean = conf_strength * U)  # No true effect of X

Using different confounder strengths (0.5, 1, 2), test two models exploring the relationship between X (exposure) and Y (outcome). For each sample size, compare:

  • Standard errors without controlling for Z
  • Standard errors when controlling for Z

Questions: * What happens to the bias when you control for Z? * How does the strength of the confounding affect the amount of bias amplification? * Can you explain why this happens using the DAG?

exercise

Use the simulation code provided in the last two exercises to create a new scenario with both a precision parasite variable (Z1) and a bias amplification variable (Z2).

Questions:

  • What happens to our estimates when we control for both variables?

  • Is it better to:

    • Control for neither
    • Control for just one (which one?)
    • Control for both
  • How can we use DAGs to decide which controls to include?

table 2 fallacy

The table 2 fallacy, first described by Westreich and Greenland in 2013, refers to a common misinterpretation in epidemiology and statistics when researchers present multiple adjusted effect estimates in a single table (often “Table 2” in academic papers).

The fallacy occurs when researchers interpret all coefficients in a multiple regression model as total effects, when in fact some are direct effects conditional on the other variables in the model. This can lead to incorrect causal interpretations, particularly when some variables are mediators (lying on the causal pathway between exposure and outcome).

For example, imagine studying how education affects income, with job type as a mediator:

  • Education → Job Type → Income
  • Education also directly affects Income

If you include both education and job type in the same regression model, the coefficient for education represents only its direct effect on income (not mediated through job type), not its total effect. However, researchers often mistakenly interpret it as the total effect.

This fallacy becomes particularly problematic when:

  1. The research question involves understanding total causal effects
  2. There are multiple pathways between variables
  3. Some variables act as both confounders and mediators

To avoid this fallacy, researchers should:

  • Clearly specify which effects (total vs direct) they’re interested in
  • Use appropriate methods like path analysis or mediation analysis when studying causal relationships
  • Be precise in their interpretation of regression coefficients
  • Consider creating separate models for different research questions

exercise

Visit https://dagitty.net/learn/graphs/table2-fallacy.html) and scroll down to the first Test Your Knowledge section. Try this exercise as many times as you need to get the correct answer twice in a row.

exercise

  • Draw out the DAG model for this research question using dagitty.net. (You can model “personality” instead of C, N, and their interaction.)

  • Assuming there are no unobserved confounds, which of these coefficients are total effects and which are direct effects?

  • If personality is the primary exposure variable, are there any covariates in here that should not be included?

  • Are there any unobserved confounds?